Vector space representation of language probabilities through SVD of n-gram matrix

نویسندگان

  • Shiro Terashima
  • Kazuya Takeda
  • Fumitada Itakura
چکیده

In this paper we introduce the vector space representation of the N-gram language model where vectors of K dimensions are given to both words and contexts, i.e., an N-1 word sequence, so that the scalar product of a ‘word vector’ and a ‘context vector’ gives the corresponding N-gram probability. The vector space representation is obtained from singular value decomposition (SVD) of the co-occurrence frequency matrix (CFM) of the context and the word. The effectiveness of the proposed representation is examined by determining how the number of N-gram parameters can be reduced through clustering and truncation of the dimensions defined on the given vector space. From the experimental results, it is confirmed that the number of model parameters can be reduced to less than 17.5% of the original number of model parameters and the proposed method is more effective than the word clustering method based on mutual information.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

G-Frames, g-orthonormal bases and g-Riesz bases

G-Frames in Hilbert spaces are a redundant set of operators which yield a representation for each vector in the space. In this paper we investigate the connection between g-frames, g-orthonormal bases and g-Riesz bases. We show that a family of bounded operators is a g-Bessel sequences if and only if the Gram matrix associated to its denes a bounded operator.

متن کامل

Compression of Breast Cancer Images By Principal Component Analysis

The principle of dimensionality reduction with PCA is the representation of the dataset ‘X’in terms of eigenvectors ei ∈ RN  of its covariance matrix. The eigenvectors oriented in the direction with the maximum variance of X in RN carry the most      relevant information of X. These eigenvectors are called principal components [8]. Ass...

متن کامل

A representation for some groups, a geometric approach

‎In the present paper‎, ‎we are going to use geometric and topological concepts‎, ‎entities and properties of the‎ ‎integral curves of linear vector fields‎, ‎and the theory of differential equations‎, ‎to establish a representation for some groups on $R^{n} (ngeq 1)$‎. ‎Among other things‎, ‎we investigate the surjectivity and faithfulness of the representation‎. At the end‎, ‎we give some app...

متن کامل

Compression of Breast Cancer Images By Principal Component Analysis

The principle of dimensionality reduction with PCA is the representation of the dataset ‘X’in terms of eigenvectors ei ∈ RN  of its covariance matrix. The eigenvectors oriented in the direction with the maximum variance of X in RN carry the most      relevant information of X. These eigenvectors are called principal components [8]. Ass...

متن کامل

OxLM: A Neural Language Modelling Framework for Machine Translation

This paper presents an open source implementation1 of a neural language model for machine translation. Neural language models deal with the problem of data sparsity by learning distributed representations for words in a continuous vector space. The language modelling probabilities are estimated by projecting a word’s context in the same space as the word representations and by assigning probabi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000